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Abstract 

Many existing cohorts contain a range of relatedness between genotyped individuals, either by design or by chance. 
Haplotype estimation in such cohorts is a central step in many downstream analyses. Using genotypes from six cohorts from 
isolated populations and two cohorts from non-isolated populations, we have investigated the performance of different 
phasing methods designed for nominally 'unrelated' individuals. We find that SHAPEIT2 produces much lower switch error 
rates in all cohorts compared to other methods, including those designed specifically for isolated populations. In particular, 
when large amounts of IBD sharing is present, SHAPEIT2 infers close to perfect haplotypes. Based on these results we have 
developed a general strategy for phasing cohorts with any level of implicit or explicit relatedness between individuals. First 
SHAPEIT2 is run ignoring all explicit family information. We then apply a novel HMM method (duoHMM) to combine the 
SHAPEIT2 haplotypes with any family information to infer the inheritance pattern of each meiosis at all sites across each 
chromosome. This allows the correction of switch errors, detection of recombination events and genotyping errors. We 
show that the method detects numbers of recombination events that align very well with expectations based on genetic 
maps, and that it infers far fewer spurious recombination events than Merlin. The method can also detect genotyping errors 
and infer recombination events in otherwise uninformative families, such as trios and duos. The detected recombination 
events can be used in association scans for recombination phenotypes. The method provides a simple and unified approach 
to haplotype estimation, that will be of interest to researchers in the fields of human, animal and plant genetics. 
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Introduction 

The estimation of haplotypes from SNP genotypes, commonly 
referred to as 'phasing', is a well studied problem in the literature. 
Existing approaches can be categorised according to the type of 
cohort each method is designed to phase, and the level of 
relatedness between the individuals in that cohort. Much of the 
recent literature is devoted to phasing nominally unrelated (or 
distandy related) individuals. Currently, the most accurate 
methods use hidden Markov models (HMMs) to model local 
haplotype sharing between individuals [1,2], and take advantage 
of linkage disequilibrium (LD). Some of these methods can also 



handle mother-father-child trios and parent-child duos [2-6], for 
more complex pedigrees there are several general pedigree 
analysis software packages [7-10]. 

However such methods face several limitations; Lander-Green 
algorithm based approaches have computational and space 
complexity that scale exponentially with sample size; they can be 
sensitive to genotyping error and they can only phase sites where 
at least one member of the pedigree is not heterozygous. The last 
point is particularly crucial, as it means the haplotypes will not be 
'complete' and cannot be easily used in pre-phasing and 
imputation which is now a standard part GWAS pipelines [11]. 
If founders in these pedigrees have been sequenced with the aim of 
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Author Summary 

Every individual carries two copies of each chromosome 
(haplotypes), one from each of their parents, that consist 
of a long sequence of alleles. Modern genotyping 
technologies do not measure haplotypes directly, but 
the combined sum (or genotype) of alleles at each site. 
Statistical methods are needed to infer (or phase) the 
haplotypes from the observed genotypes. Haplotype 
estimation is a key first step of many disease and 
population genetic studies. Much recent work in this area 
has focused on phasing in cohorts of nominally unrelated 
individuals. So called 'long range phasing' is a relatively 
recent concept for phasing individuals with intermediate 
levels of relatedness, such as cohorts taken from popula- 
tion isolates. Methods also exist for phasing genotypes for 
individuals within explicit pedigrees. Whilst high quality 
phasing techniques are available for each of these 
demographic scenarios, to date, no single method is 
applicable to all three. In this paper, we present a general 
approach for phasing cohorts that contain any level of 
relatedness between the study individuals. We demon- 
strate high levels of accuracy in all demographic scenarios, 
as well as the ability to detect (Mendelian consistent) 
genotyping error and recombination events in duos and 
trios, the first method with such a capability. 

imputing sequenced variants from founders into descendants who 
have been assayed on microarrays, then a pedigree phasing 
method that overcomes these issues will be especially useful. 

The task of phasing in isolated populations is some what of a 
special case, as individuals from such populations exhibit much 
higher levels of relatedness, and will tend to share much longer 
stretches of sequence identically by descent (IBD) than a pair of 
unrelated individuals from a non-isolated population. Kong at 
al.(2008) [12] proposed a method in which surrogate parents are 
identified for each individual in a given region of the genome. 
These surrogate parents allow the haplotypes to be determined 
with high accuracy using Mendelian inheritance rules, effectively 
as if the true parents had been observed and the family could be 
phased as a trio. More recently, a model based version of this 
approach called Systematic Long Range Phasing (SLRP) has been 
proposed [13]. Both of these papers demonstrated accurate 
haplotype estimates within IBD regions, but suffer from the 
problem that phase can only be inferred for genomic regions 
where IBD sharing is detected. Even in IBD regions, if a site is 
heterozygous in all individuals, the phase at that particular locus 
cannot be inferred. 

So far in the literature there has been very little investigation of 
the performance of methods for phasing in isolated populations. In 
addition, many GWAS cohorts consist of a range of relatedness 
between the study individuals. Some cohorts contain mixtures of 
pedigrees, weakly or cryptically related individuals and more 
distandy related individuals. Methods for carrying out association 
studies using related individuals have recently been re-discovered 
in the literature as a powerful approach, with the additional 
benefit of implicitly avoiding confounding due to population 
structure [14-16]. In addition, exphcit detection of tracts of IBD 
between pairs of individuals is becoming more widely used for 
detection of disease genes [17-20] and for population genetic 
analyses [21,22]. More generally, isolated populations offer 
promise for interrogating common complex diseases [23]. For 
many such cohorts phasing will be a first step in performing 
imputation from a reference panel [1 1] or as part of an IBD 



detection analysis, so it is interesting to consider the performance 
of alternative phasing methods. 

We recently compared several methods all designed to phase 
nominaUy unrelated samples (SHAPEIT2 [2], SHAPEITl [5], 
Beagle [4], HAPI-UR [6], Impute2 [24], MaCH [25], fastPHASE 
[26]) and found that SHAPEIT2 was the most accurate method in 
this setting. In this paper we examine the performance of these 
methods at increasing levels of relatedness between individuals. To 
do this we used cohorts from six different isolated populations (and 
two additional cohorts from non-isolated populations). Each of 
these cohorts contain some extended pedigrees allowing us to 
assess performance on both nominally unrelated individuals and 
on explicitly related samples. 

For cohorts with explicitiy related samples we introduce a new 
hidden Markov model (which we call duoHMM) that can estimate 
the inheritance pattern between between the haplotypes of each 
parent-child duo. This method can be used to visualise the 
inheritance status across a chromosome, correct phasing errors 
that are inconsistent with pedigree information, and detect 
genotyping errors. We show that after applying this adjustment, 
SHAPEIT2's haplotypes are accurate enough that we can detect 
exphcit recombination events between parent-child pairs. Apply- 
ing this method to the SHAPEIT2 inferred haplotypes provides 
the most accurate performance in the extended pedigree setting. 
Using our method we are able to demonstrate that the 
recombination events that we infer from otherwise uninformative 
duos and trios can add power to association scans for recombi- 
nation phenotypes. Specifically, at the estabhshed PRDM9 locus 
we are able to show that including these extra recombination 
events increases the signal of association for a hot spot usage 
phenotype. Overall, the combination of SHAPEIT2 and 
duoHMM provides a very general method for accurate phasing 
of cohorts with any levels of implicit or explicit relatedness 
between individuals. 

SHAPEIT2 and duoHMM are available from the website: 
http: / / www.stats.ox.ac.uk/ marchini/software/gwas/ gwas.html 

Materials and Methods 

Real datasets 

To provide a comprehensive assessment of the accuracy of 
methods we analysed eight different cohorts that vary in the extent 
of the relatedness between individuals. The cohorts are summa- 
rised in Table S 1 , six of these are considered to be from isolated 
populations. The Orkney Complex Disease Study (ORCADES) is 
an ongoing study in the isolated Scottish archipelago of Orkney 
[27]. The CROATIA- VIS (Vis) and CROATIA-KORCULA 
(Korcula) studies contain individuals recruited from the Dalmation 
islands of Vis and Korcula [27,28]. The INGI-Val Borbera 
population is a collection of 1,664 genotyped individuals collected 
in the Val Borbera region, a geographically isolated valley located 
within the Appennine Mountains in Northwest Italy. The valley is 
inhabited by about 3,000 descendants from the original popula- 
tion, living in seven villages along the valley and in the mountains 
[29]. The INGI-FVG Cohort is a collection of six different isolated 
villages in the Friuli Venezia Giulia region of northern Italy [30] . 
The INGI-CARL cohort contains individuals from Carlantino, a 
small isolated village in the province of Foggia in southern Italy 
[30]. The CROATIA-Split (Split) cohort contains individuals from 
the Croatian city of Split [31]. Finally, a large sample from the 
Ugandan General Population Cohort (GPC) [32], covering 
residents of 25 villages in south-Western Uganda were analysed. 
These final two cohorts are not considered to be isolated and 
hence are useful as control samples of unrelated individuals. Each 
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of these cohorts contain pedigrees of varying sizes (see Table SI) 
which can be used to evaluate phasing accuracy. The GPC cohort 
was genotyped using the Dlumina Human OMNI 2.5S chip. AU 
the other cohorts were genotyped using either the lUumina 
HumanHapSOO or HumanCNV370 chips. 

In addition to quality control (QC) performed on each cohort 
by their respective research groups, we apphed stringent filters to 
remove genotypes inconsistent with pedigree structure. Firstly, we 
ran Pedstats [33] to detect any genotypes that violated Mendelian 
constraints, and these loci were marked as missing for all 
individuals in a pedigree where violations were found. Loci that 
produced Mendel violations for > 5% of samples were filtered for 
all individuals in a cohort. Secondly, Merlin's error detection 
cdgorithm was used on all pedigrees, and genotypes which were 
unlikely were also flagged as missing. This final set of genotypes 
were used as input in all subsequent analyses. 

AU software was run as per instructions in their respective 
manuals. The computation times for each method for the largi;st 
experiment conducted on European cohorts are summarised in 
Figure SI, a more comprehensive study of running times was 
conducted in the original SHAPEIT2 paper [2]. 

Creating validation haplotypes. We phased the pedigrees 
in each cohort using Merlin (version 1.1.2), which produces the 
most likely haplotypes given the pedigree structure using the 
Lander-Green algorithm. The phasing occurs one pedigree at a 
time and no information is shared between pedigrees. These 
haplotypes should be highly accurate and hence suitable for 
validation purposes. The accuracy of the haplotypes will increase 
with pedigree size, so in some of our experiments we only use those 
haplotypes from larger pedigrees. 

Merlin can only phase loci where at least one pedigree member 
is homozygous. We found that 50.16% of heterozygote sites could 
be phased for duos, 77.79°/) for trios and >99% sites for pedigrees 
of size > 8 (Figure S2). Running times for Merlin increase 
exponentially as pedigree size increases (see Figure S3) but in 
general were not excessive on these data sets (<1 hour per 
cohort). 

Haplotype accuracy in founder individuals. We merged 
the pedigree founders with individuals who were not in any 
explicit pedigree for each cohort, that is, all non-founders from 
pedigrees were excluded. This gave us a sample of (nominally) 
unrelated individuals that we phased using each of the methods. 
We applied SLRP (version 09fDf52), SHAPEIT2 (r613). Beagle 
(version 3.3.2) and HAPI-UR (version 1.01) to these founder data 
sets. We then calculated the switch error (SE) [34] of the haplotype 
estimates for the pedigree founders, treating the Merlin haplotypes 
as the truth. This evaluation pipeline is visualised in Figure S4. 

SLRP does not produce whole chromosome haplotypes. It only 
phases regions of the genome where IBD sharing is detected, and 
can only resolve the phase of heterozygous sites when at least one 
of the individuals sharing IBD is homozygous at those loci. This 
complicates the calculation of SE between methods. We treated 
each of the IBD segments inferred by SLRP separately and 
evaluated SE within these regions. We refer to this metric as the 
"within IBD SE", using it to evaluate SLRP's performance against 
methods that phase every site. We also calculated the SE of the 
SHAPEIT2, Beagle and HAPI-UR haplotypes across the whole of 
chromosome 10. We also report the yield of SLRP, defined as 
percentage of genotypes that are phased. 

Haplotype accuracy in explicitly related 
individuals. Individuals in pedigrees obviously share large 
amounts of their genome IBD. Algorithms that have the ability 
to exploit IBD sharing in distandy related individuals may also 
work well on explicitiy related individuals. Hence, we also 



evaluated the accuracy of SHAPEIT2, SLRP, HAPI-UR and 
Beagle apphed to the full cohorts described here, with the fuU 
extended pedigrees included. We ran each of the methods using no 

information regarding relatedness of samples. We calculated SE 
for each method on the haplotypes of any individual in a pedigree 
larger than a mother-father-child pedigree, using the Merlin 
haplotypes as truth. 

SHAPEIT2, Beagle and HAPI-UR all provide functionality to 
phase parent-child duos and mother-father-child trios, by 
constraining the possible haplotypes to those consistent with the 
transmitted and untransmitted haplotypes of each parent (the child 
having each of the parents' transmitted haplotypes). This approach 
win produce very accurate haplotypes although will return the 
recombined haplotypes for each parent, rather than the true 
parental haplotypes. Since only several recombinations occur per 
chromosome, this is not introducing a substantial amount of error 
in the context of pre-phasing/imputation but is obviously 
problematic for researchers wishing to study recombination. 

Larger pedigrees could be divided into subsets of duos and trios 
but often there will exist no subdivision that allows all samples to 
exploit a parental relationship. For example, a family with two 
parents and two siblings may be divided into two duos, but 
partitioning a nuclear family with three children means at least 
one child will be phased without using parental information. 
There is no obvious optimum way to partition pedigrees of 
arbitrary' size and structure. We investigated a simple method 
where we enumerate every possible partitioning of a pedigree into 
duos/ trios and choose the partition that minimises the number of 
individuals that are not included in a duo/trio (many partitions 
often share the same minimum in which case one is picked at 
random). We applied this partitioning to the datasets and then ran 
Beagle (since it was the next most competitive method) taking the 
implied duo and trio information into account. We refer to this as 
the Beagle duo/trio method. Beagle was found to use substantially 
more memory in this setting (over 150 GB for the GPC cohort) 
which may be problematic for some researchers. This issue is 
noted in the Beagle manual and relates to missing data in parent- 
offspring duos and trios. 

On duos and trios this method wiU agree perfectiy with Merlin 
at sites that Merlin can phase. This introduces a possible 
confounding effect when using the Merlin haplotypes as the truth, 
as any errors in the Merlin haplotypes will not be detectable when 
compared to the Beagle duo/trio method. We show below using 
simulated data that Merlin is quite sensitive to genotyping error 
and that this does result in elevated switch errors. For this reason 
we only consider pedigrees that are more complex than a parent- 
child duo or father-mother-child trio when comparing methods. 
Larger pedigrees also give Merlin better ability to remove 
genotyping errors yielding more accurate validation haplotypes. 

Using pedigree information to improve phase, infer 
recombination events and detect genotyping error 

The results below show that SHAPEIT2 can implicitiy leverage 
IBD sharing and hence phase a pedigree accurately without any 
relationships specified. Additional use of explicit relationships is 
likely to lead to even greater improvements. The Lander-Green 
algorithm is traditionally the method of choice for phasing 
pedigrees but has several limitations described previously. We 
developed a simple HMM applicable to the SHAPEIT2 haplo- 
types that corrects phasing errors that are inconsistent with 
pedigree information. The method focuses separately on each 
parent-child duo and this circumvents several issues with the 
Lander-Green algorithm, namely; 
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• complexity: our HMM has a constant number of hidden states 
(4) and possible transitions between states (16) per meiosis, so 
our method scales as 0(NL) where is the number of non- 
founders (the method runs on each parent separately) and L is 
the number of markers. This compares well to the 0(2'^ L) 
scaling for a naive Lander-Green implementation. 

• heterozygous markers: markers that are heterozygous throughout a 
pedigree will be phased via leveraging population haplotypes 

• sensitivity to genotyping error: the low computational complexity of 
the model allows us to accommodate genotype uncertainty 

We describe the model and several useful applications of it 
below. We refer to this framework as the duoHMM in later 
sections of the paper. 

Duo HMM. Let p = (pi,P2) and c = {c\,C2) denote a pair of 
observed (ordered) parental and child haplotypes respectively. 
Here pi = {pn ,.. . ,piL} denotes the rth parental haplotype at the L 
sites across a chromosome. The same notation is used for the rth 
child haplotype c,. There are 4 possible patterns of gene flow 
between the parent and child. The true pattern of gene flow wiU 
remain constant over long stretches of a chromosome due to the 
low rate of recombination in any given meiosis. We use Si to 
denote the pattern of gene flow at the /th locus, where S/ = (J,k) 
means that the parents y'th haplotype and the child's kth haplotype 
are identical by decent (IBD). Here j,k e {1,2}, so there are just 4 
possible inheritance patterns, which we denote 
A=i\,l),B = (2,l),C = {\,2),D = (2,2). The true inheritance states 
S = {S[, . . . ,Sl} are unobserved across each chromosome and we 
wish to infer them from our imperfect observations of the parental 
and child haplotypes p and c. The intuition behind our approach 
is that true recombination events and SEs will cause changes to the 
pattern of gene flow as we move along a chromosome. Since we 
expect just a few true recombination events the SEs wiU tend to 
dominate. Thus we can think of the observed pattern of gene flow 
as the superposition of two point processes: one with a low rate 
dictated by true recombinations, and a second process with a rate 
relating to SEs. Our aim is to deconvolve these two processes to 
detect the true recombination events and correct SEs. 

To carry out this inference we have developed an HMM that 
allows for SEs in the parental and child haplotypes. We use and 
to denote the probability of a SE on the parental or child 
haplotypes between two adjacent markers respectively. We also 
use Pi to denote the probability of a recombination occurring 
between markers / and Specifically we use pi = l—e^''' 

where r/ is the genetic distance between markers / and / + 1. We 
use the genetic distances from the HapMap LD based map [35] 
(which are inherently sex averaged) and scale them to the sex- 
specific genetic lengths from the deCODE 2002 map according to 
the sex of the parent. 

The initial states of the Markov model are given by P(Si ) = 1/4. 
The transition rates on the IBD states S are then given by 



P{S,+ i = v\S, = u) = 

{l-S2)x{(l-5i)il-p,) + Sipi) if (m, V) 6 Ti 

(l-52)x{(l-6i)p, + Si(l-p,)) if (M,v)er2 

<52 X ((1 -^i)(l -pi) + Sip,) if (u,v) e T, 

Si X ((1 -^i)ft + ^i(l -p,)) if («,v) e 



The sets T\,T2,TT„Ti, denote the different types of transition that 
can occur. The set Ti={(A,A),(B,E),(C,C),{D,D)} contains the 



transitions where no change in gene flow occurs. The set 
T2 = {(A,B),(B,A),(C,D),(D,C)} are the transitions with a change 
only to which of the parent's haplotypes are IBD. The set 
T3 = {(A,C)XB,D),(C,AX(D,B)} are the transitions with change 
only to which of the child's haplotypes are IBD. The set 
T4 = {(A,D),(B,C),{C,B),(B,A)} are the transitions with a change 
to both of child and parental IBD haplotypes. Figure 1 shows 
examples of how true recombination events and SEs in parental or 
child haplotypes lead to changes in the inheritance pattern in 
terms of T2, T3 and Tn events. 

We accommodate genotyping error by allowing for errors in the 
emission part of the HMM. We model the observed haplotypes at 
the /th locus conditional upon the inheritance state Si as follows 



P(j},,ci\Si = (j,k))-- 



l-£ \ipii=Clk 



e if pij it cik 
Our full model can then be written down as 

p{p,c,s)=p(Si)''n p{s,+ 1 \Si) h p(p,,c,\Si)- 

This method can be applied to any set of estimated haplotypes 
from parent-child pairs. We run one iteration of the Fomard- 
Backward algorithm [36] to estimate £, ^1 and S2. Since there is 
littie uncertainty in the state path this was found to be adequate for 
convergence. This estimation is carried out on each duo 
separately. Since the HMM has just four states the computation 
involved is negligible. 

We applied the duo HMM to the Beagle and SHAPEIT2 
haplotypes from each cohort and examined the Viterbi paths for 
each duo. The Viterbi path is the most likely underlying state 
sequence, given the observed data [36]. We split duos according to 
the sex of the parent as we know that the rate of recombination 
events is higher in females than in males [37]. 

Correcting haplotypes. We now describe how to use our 
model to adjust haplotypes so that they are consistent with a given 
pedigree structure. After estimating parameters, we run the Viterbi 
algorithm to find the most likely state sequence. There are sixteen 
possible state transitions in our model. The eight transitions in the 
sets and T4 imply a SE in the child haplotypes, so when we 
observe one of these transitions in the Viterbi sequence we infer a 
SE in the child. The eight transitions in the sets T2 and imply 
either a SE or a recombination event in the parental haplotypes. 
Inferring whether a recombination or a SE has occurred in the 
parental haplotypes is difficult. When more than one sibling is 
present in a pedigree, we can correct probable parental SEs via 
identifying the minimum recombinant haplotypes for the family. 
When one of the T2 or transitions is present in the same 
location for the majority of siblings, this is most likely a SE on the 
parental haplotypes and they can be corrected accordingly (see 
Figure S5 for an example of this process). This is not strictly the 
maximum-likelihood solution, but the minimum-recombinant and 
maximum likelihood solutions often yield the same result [38]. 
When we infer a SE in either a parent or a child we correct the 
haplotypes by switching the haplotype phase of all loci proceeding 
the SE. This procedure is carried out left to right along the 
sequence. 

Corrections are apphed sequentially 'down' through each 
pedigree. For example, in a three generation (grandparent- 
parent-child) pedigree we first apply the method to those duos 
containing grandparents. Any corrections made to the parents 
haplotypes are used when processing duos involving those parents 



PLOS Genetics | www.plosgenetics.org 



4 



April 2014 I Volume 10 | Issue 4 | el 004234 



Haplotype Phasing in Related Samples 



True haplotypes inferred correctly and true 
recombination event inferred as Tj transition 

Pi '^OT 1 0 1 O'O'0 1 lllllOllllllO 11110 llT^ 

P2 011011111111111111011111111101 
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C2 111111011111111110101111101010 



Switch error in the child's haplotypes that causes 
a T3 transition 
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P2 011011111111111111011111111101 



Switch error in the parent's haplotypes that 
causes a Tj transition 
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Switch error in the parent's haplotypes at true 
recombination event. 



T2 T2 



Switch error in the parent's and child's 
haplotypes that causes a T4 transition 



Pi 011010001111110111111011111101 
P2 011011111111111111011111110111 

ci '^TTf/m^^^^^^^^TT^i 111101 
C2 111111011111111110101111101010 





11111111101 


0110111111111111110 


11011110111 



Ci 011010 0 01111110111101111101010 
C2 1111110111111111101 .110 11111101 



Figure 1. Examples of inferred haplotypes with true recombination events and SEs. In each examples p\, pi, c\ and C2 denotes the two 
parental and child haplotypes and S denotes the pattern of gene flow. Top: Correctly inferred haplotypes in a region of a true recombination event 
that causes a Ti transition in the duo HMM. The other 4 examples in the figure add SEs to these true parental and child haplotypes. Middle left: 
addition of a SE in the child's haplotypes that causes a transition. Middle right: addition of a SE in the parent's haplotypes that causes a T2 
transition. Bottom left: addition of a SE in the parent's haplotypes at the site of the recombination event that causes the T2 transition to be missed. 
Bottom right: addition of a SE in both the child's and parent's haplotypes at the same position that causes a Ta, transition. 
doi:1 0.1 371/journal.pgen.1 004234.g001 



and their children. This removes any (detectable) SEs for 
individuals that have parents, before their descendants are phased. 
We applied our method of correcting haplotypes to all of the 
chromosome 10 haplotypes produced by SHAPEIT2, Beagle and 
HAPI-UR in all cohorts and evaluated the improvements in 
accuracy as well as the number of corrections that were required. 

Detecting recombination events. Once aU the haplotypes 
have been corrected the duoHMM is re-run in order to infer 
recombination events. We do this by calculating the probability of 
a recombination event between markers. A transition between the 
parental haplotypes corresponds to either a SE or a genuine 
recombination. A recombination event can only be resolved down 
to the region between its two flanking heterozygous markers in the 
parent. We use Rij+m to be the indicator variable of a 
recombination event between heterozygous markers / and l + m. 
We evaluate the posterior probability of such a recombination 
event as 



P{Ru+,„ = l^,c)<x P(Si = u,Si+„ 

(",v) 



v,R. 



IJ+m '- 



where 

P(S/ = M,S/+,„ = v,Rij+i„ = 1 ,p,c) = P(px„„j,ci i,Si = u) 

xP(S,+,„ = v,Rij+„, = l\Si = u) 

The first and last probabilities can be calculated from the forward- 
backward algorithm, and the transition rates that include a 
recombination event are as follows 



P(Si+i„ = v,Rij^ 



-l\S, = u)-- 



(l-S2)SiPi if{u,v)eTi 

(l_^2)(l-<5i)P/ if(w,v)er2 

SjSip, if(u,v)eTi 

(52(1— i5i)p; if (m,v) e 74 



Note since the loci between / and l + m are homozygous the 
emission probability is the same regardless of state, hence we do 
not require this term in the calculation. A recombination event is 
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inferred when 

P{Rl,l+„ = l\p,c)>t 

for some threshold t e (0,1). In the analysis in this paper we have 
used / = 0.5. To calculate these probabilities we tried using 
SHAPEIT2's most likely (pedigree corrected) haplotypes, or 
sampling haplotypes from SHAPEIT2's diploid graph (again 
correcting for pedigree structure) and repeatedly calculating 
recombination probabilities and averaging the resulting maps. 
We find that both these methods produce similar results (data not 
shown) and we only report results from the sampling based 
approach. 

We applied this method separately to all the individuals in each 
of the eight cohorts. We then compared the regions where our 
HMM detected a recombination event to the recombination 
locations found in Merlin's output (the Viterbi solution to the 
Lander-Green algorithm). To reliably detect recombination 
events, a Lander-Green implementation requires either a nuclear 
family with at least tlirc'c children or a three generation pc'digree 
[39,40]. Hence we only evaluate meioses that meet these criteria in 
the real data sets, referring to these as informative meioses. 

Detecting genotyping error. We can also use this model to 
detect genotyping error at locus /, by summing over the posterior 
probabilities of inheritance states that have inconsistent haplo- 
types. We use the indicator variable £/ 6 {0,1} to denote the 
absence /presence of a genotyping error at locus / in a duo. Then 
we have 

2 2 

P{Ei = 1 \p,c) =J2Y1 ^^J' * ^»)P(Si = {j,k)\p,c) 

7 = 1 k=\ 

where P{Si = {j,k)\p,c) is the posterior probability of inheritance 
pattern (j,k) and can be efficiently calculated from the HMM 
model. This is the probability of a genotyping error occurring in at 
least one member of the duo given the observed haplotypes. We 
show on simulated and real data that masking genot)'pes with 
P(Ei=\\p,c)>0.9 yields a drop in switch error rates suggesting 
that this is an effective error detection method. 

Using detected recombinations for association scans of 
hotspot usage. To demonstrate the utility of our recombination 
detection method we conducted association testing between 
genetic variants in the PRDM9 region (chr5:23007723- 
24028706) and the "hotspot usage" phenotype described in Coop 
et al. (2008) [39]. Substantial association in this region was also 
found in Kong et al. (2010) [12] and Hinch (2011) [41]. We 
calculated the same phenotype as Coop et al. (2008), the 
proportion of crossover events, a,-, that occur in a recombination 
hotspot for individual / (tlie parent). This \ alue was corrected for 
the probability that events occur in one of these hotspot regions by 
chance via simulation. 

The accuracy with which a, is measured increases with the 
number of crossovers observed for that pannit, hence parents with 
more obserx^ed crossovers should be given higher weighting (large 
nuclear famihes are advantageous in this situation). We weighted 
individuals by creating pseudo-counts of hotspot events A:, = «,- X a,- 
where n,- is the number of crossover events observed for parent /. 
We then fit a standard Binomial Generalised Linear Model (GLM) 
with (A:,,M,) as the response and the genetic dosage at each SNP as 
the covariate. We then performed a likelihood ratio test between 
this model of association and the 'nuU' model where no genetic 



variant is included. Variants were imputed from the 1000 
Genomes March 2012 reference panel and filtered such that all 
variants had MAF>0.01 and INFO > 0.4 in all cohorts. 

The use of the Binomial GLM allows us to leverage parents who 
are part of typically uninformative meioses, where it is unlikely the 
majority of crossover events were detected. Such individuals are 
simply down weighted in our association testing. 

Simulation study 

In our real data experiments we use haplotypes inferred by 
Merlin as the 'true' haplotypes for our methods comparison. In the 
Results section we show that SHAPEIT2 phases extended 
pedigrees with close to perfect concordance with the haplotypes 
produced by Merlin (typically <0.1% average SE). This k-vel of 
discordance is of a similar order to both the number of 
recombination events, and genotyping error [42] which the 
Lander-Green algorithm is known to be sensitive to. Whilst we 
have applied standard quality control procedures (including 
Merlin's error checking) to these data, genotyping errors are likely 
to stiU be present. Hence at least some of this discordance may be 
in fact due to errors in Merlin haplotypes. We also compare the 
recombination events detected by MerUn in extended families to 
those detected by our duoHMM approach. Any discordance 
between these crossover callsets may also be due (in part) to Merlin 
errors. We also wanted to investigate the ability of our method to 
call crossover events in duos and trios which cannot be done with 
the Lander-Green algorithm. For these reasons we created several 
simulated datasets to investigate these issues. 

We utilised male chromosome X haplotypes as the basis for 
these simulated datasets. Since males only have one copy of 
chromosome X, phase is unambiguously known. As in previous 
phasing studies [2,43,44], two male X chromosomes were 
combined to create a pseudo autosomal diploid founder individual 
where the true underlying haplotypes are known. We then 
randomly mate these new diploid individuals to produce offspring 
with recombined haplotypes. Crossover e\'ents were simulated as a 
Poisson process on the genetic lengths from the HapMap 
Chromosome X genetic map for females and the same map 
scaled by 0.605 for males (difference in rates estimated from 2002 
deCODE Map). 

In all experiments, we applied a simple rejection sampling 
scheme to avoid large amounts of consanguinity in our new diploid 
individuals and their offspring. The X chromosomes used to create 
pedigree founders were sampled such that no pair of chromosomes 
came from pairs of males with genome wide relatedness r>0.01 
[45]. We conducted these experiments using the 107 1 (607 females 
and 464 males) nominally unrelated individuals from the Val 
Borbera cohort. This allowed us to create up 232 to diploid 
individuals with known haplotypes. 

A simulated dataset with extended pedigrees. We wished 
to investigate accuracy on pedigrees that are collected as part of a 
larger cohort, hence we simulated pedigrees with the same 
structures as those observed in the Val Borbera cohort. We only 
used those pedigrees having "informative" meioses (> 3 siblings or 
three generations) and generated founder individuals using male X 
chromosome data. These simulated founders were then "mated" 
to create descendants (and the descendants were also mated in 
cases of three generation pedigrees). There were 65 such 
pedigrees, with 199 founders although only 108 of these founders 
were assayed. We carried out two sets of simulations: one realistic 
scenario where we attempt to emulate the type of data collected in 
practice and one ideal scenario which should be advantageous for 
Merlin. 
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In the realistic scenario we simulated genotyping errors based 
on a confusion matrix (Table S2) from a previous study [42], that 
was estimated by comparing genotypes called on both an 
AfFymetrix Axiom chip and a lUumina Omni 2.5S chip on 1000 
Genomes individuals. There were many cases where founders 
were missing in practice. We also removed the 91 ungenotyped 
founders leaving 108 genotyped founders and a total of 314 
individuals in pedigrees. Finally these extended pedigrees were 
merged with the 607 female chromosome X data (and the 
remaining 33 pseudo-autosomal males not in a pedigree) to create 
a cohort of 9.54 individuals containing 3 1 4 individuals in pedigrees 
and 640 unrelated individuals. We created 10 versions of this 
simulated dataset. In the ideal scenario we do not add any 
genotyping error and do not remove founders (leaving all 199 
founders in the data giving us a data set with 1045 individuals (405 
within a pedigree). 

We use the realistic dataset to compare three different versions 
of our method for phasing extended pedigrees. First we applied 
SHAPEIT2 ignoring all pedigree information. Secondly, we 
applied our duoHMM method to this set of haplotypes to correct 
SEs. We also used the duoHMM output to identify positions 
where there was strong evidence of genotyping error, and we 
investigate the effect of excluding these sites for accuracy 
comparisons. Finally, we applied our method of partitioning the 
extended pedigrees into trios and duos and then ran Beagle using 
this level of famUy information. We also evaluate Merlin's 
haplotype accuracy on this simulated data. 

We also ran Merlin and our duoHMM method on these 
datasets to detect recombination events on all informative meioses 
which allowed us to investigate the sensitivity and specificity of the 
methods in both a realistic and ideal scenario. In the reahstic and 
ideal simulations, there were 183 and 212 informative meioses 
containing a total of 2422 and 3131 crossover events (across all 
simulations) on which to evaluate accuracy. 

A simulated dataset of uninformative duos. The ability to 
detect recombination events is enhanced if an individual is part of 
a large pedigree. Such pedigrees will contain parent-child 
relationships where the parent's heterozygous sites can be phased 
independendy of that child's loci (such as from a grandparent or 
another child) this allows cliangc-s in the pattern of inheritance 
(recombination events) to be detected. Previous work on detecting 
recombination events in pedigrees have used such informative 
pedigrees [37,39,41,46]. We wanted to assess the power of our 
method to detect recombination events in uninformative duos. 

We created a simulated dataset of 1 1 6 mother-father-child trios 
using the 464 chromosome X haplotypes from the Val Borbera 
cohort. These trios were merged with the diploid female 
chromosome X data to increase sample size. 

We created a second version of this simulated dataset by first 
removing individuals from the Val Borbera dataset such that no 
pair of individuals had a genome wide relatedness ;'>0.35 in the 
remaining data. This reduced the dataset from 1071 to 778 
individuals (440 females and 338 males) allowing us to simulate a 
dataset with 84 mother-father-child trios, merged together with the 
original 440 females. By removing closely related individuals from 
the cohort we create a simulated dataset more like a non-isolated 
population. Detection of recombination events will be harder in 
this setting due to reduced levels of IBD sharing. 

The merging, mating and recombination was simulated ten 
times (each simulation analysed separately) creating 2270 and 
3190 recombination events to evaluate sensitivity and specificity of 
our method. We then attempted to detect the recombination 
events using the pipeline previously described in the Methods 
section. 



Results 

Levels of relatedness within each cohort 

Figure 2 (left) shows the proportion of heterozygote sites phased 
by SLRP, which we refer to as the yield. SLRP's yield ranged from 
31.82% for the Split cohort to 88.15% for the ORCADES cohort. 
Split and GPC were the only cohorts with less than 60% yield 
demonstrating low levels of IBD sharing between individuals in 
these cohorts. Following Palin et al. [13] we also examined 
individuals who were not "closely" related by excluding all 
individuals with a realised relatedness [45] of r>0.125. We found 
the yield was substantially lower in the CARL and FVG cohorts 
demonstrating some of the IBD sharing present was between 
closely related individuals rather than distant cousins in these 
cohorts. AH other cohorts did not exhibit as large a drop in yield 
after removing closely related individuals, highlighting the large 
amounts of IBD sharing between more distantly related individ- 
uals in these cohorts. 

Similar to Kong et al. (2008) [12], we took each individual in 
turn and at each locus we calculated the number of other 
individuals that share an IBD segment (excluding closely related 
individuals with relatedness r>0.125). We then took the average 
across all loci on chromosome 10 for each individual and plotted 
the distribution of this average IBD sharing in Figure 2 (right). The 
average IBD sharing is a function of both the sample size and the 
amount of relatedness between individuals in the population. Spht 
again clearly has very small amounts of IBD sharing whilst the 
other cohorts have broadly similar distributions. It is notable that 
all cohorts have some individuals with < 1 surrogate parent on 
average while some individuals have > 10 surrogate parents. 

Haplotype accuracy in founder individuals 

Table 1 shows the SE rates for SHAPEIT2, SLRP, Beagle and 
HAPI-UR when run on the founder individuals of each cohort, 
bodi within and outside SLRP IBD segments. SHAPEIT2 
consistently produced the most accurate haplotypes of all methods 
within IBD regions. SHAPEIT2 had a mean SE rate of between 
0.14% (ORCADES) and 0.75% (CARL), the next closest method 
was SLRP with SE rates between 0.2S% (GPC) and 1.99'/o (Split), 
followed by Beagle with SE rates bet\\<-en ().3()"() iGPC) and 
4.38% (Split). HAPI-UR had high SE rates ranging between 
0.35% (GPC) and 8.30% (Split). The GPC cohort stands out here, 
as all methods perform very accurately (<0.4% SE) which can be 
explained by the larger sample size of this cohort (2,676 
individuals in total) and the much denser chip (lUumina Human 
OMNI2.5S) used to genotype this cohort. It is interesting that 
SHAPEIT2 seems to have the highest SE rates on the CARL 
cohort, which has only the second lowest level of relatedness 
between founders (Figure 2 (right)). Figure S6 (left) shows the 
SLRP IBD SE rate against the SHAPEIT2 rate for each indiv idual 
in the Val Borbera cohort. This highlights that both methods 
product; SE rates close to zero on many individuals but 
SHAPEIT2 is generally more accurate. 

When calculating SE rate across the whole of chromosome 10 
(not just in IBD regions), SHAPEIT2 also has the lowest error rate, 
ranging from 0.28% (GPC) to 2.65% (Split cohort) as opposed to 
0.49% and 5.57% for Beagle and 0.50% and 11.10% for HAPI- 
UR. Switch error rates for SLRP cannot be evaluated across the 
whole of chromosome 10 due to the method only producing 
partially phased haplotypes. We observe that all methods perform 
relatively better within IBD regions than across the whole 
chromosome. However, the difference for SHAPEIT2 is much 
larger than for Beagle and HAPI-UR. These results suggest that 
whilst none of these methods explicitly model IBD sharing, its 
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Figure 2. Summary of IBD sharing in cohorts. Left: The proportion of heterozygote sites phased by SLRP for all individuals (pink) and when 
individuals with close relatives (c > 0.125) are removed (blue). Right: The distributions of the average number of "surrogate" parents for each cohort 
when closely related pairs (i->0.125) are removed. 
doi:1 0.1 371/journal.pgen.1 004234.g002 



presence tends to be exploited implicitly, and particularly so in 
SHAPEIT2. Figure S6 (right) plots the SHAPEIT2 SE rate within 
IBD regions (detected by SLRP) against the rate outside these 
regions. Switch error is clearly close to zero when IBD sharing is 
present and has a rate more comparable to non-isolated 
populations when no IBD sharing is present. 

Haplotype accuracy in extended pedigrees 

Treating all individuals as unrelated. Table 2 shows the 
SE rates for SHAPEIT2, SLRP, Beagle and HAPI-UR when run 
on all individuals in each cohort and ignoring any of the 
information about explicit family relationships between individu- 
als. We calculated SE for each method on the haplotypes of any 
individual within an extended pedigree more complex than a 
simple duo or trio, using the Merlin haplotypes as truth. 

All of the methods have lower SE rates than when run on just 
the founders of the pedigrees (Table 1) but the improvement for 
SHAPEIT2 is most striking. We find that SHAPEIT2 achieves 
between 0.059% (Korcula) and 0.241% (CARL) SE rate on 
individuals within pedigrees. SLRP achieves a very high yield in 
most cohorts (>90% except in Split and GPC) for individuals 
within pedigrees and improved accuracy within the IBD regions it 
detects. Both Beagle and HAPI-UR (3x) are also more accurate 
in this setting as well but do not obtain the same gains as 
SHAPEIT2. 

duoHMM corrected haplotypes. We applied our 
duoHMM method to the SHAPEIT2, Beagle and HAPI-UR 
haplotypes that were estimated ignoring all family information. To 
give a sense of the output of applying this model Figure 3 shows 
the Viterbi paths of 50 male parent duos from the Val Borbera 
cohort for both SHAPEIT2 and Beagle. The four possible IBD 
states (A,B,C,D) are shown using colours pale blue, dark blue, 
light red and dark red respectively. Changes between a blue and 
red colour correspond to a T3 or T4 transition, both of which 
imply a SE in the child. Changes of colour between light and dark 
blue or between light and dark red correspond to T2 transitions, 
which correspond to a change on IBD state in the parent, and 
could be caused by a recombination or a SE in the parent. Figure 1 
provides examples that can be helpful in interpreting Figure 3. 
Figures S7, S8, S9, SIO, Sll, S12, S13, S14, S15, S16, S17, S18, 
S19, S20, S21, S22 show the Viterbi paths of male parent and 
female parent duos for all of the cohorts. 



Figure 3 shows a striking difference between the output on the 
SHAPEIT2 and Beagle haplotypes. The SHAPEIT2 paths have 
very few transitions of all types, and when transitions occur they 
are predominantly T2 transitions. The figure shows only father- 
child duos and chromosome 10 has been estimated to have a 
genetic length of 1.34 Morgans for paternal meioses [37]. The 
numbers of T2 transitions in the 50 duos in Figure 3 looks 
reasonably consistent with this genetic length, suggesting that the 
T2 transitions are indeed true recombination events. We note that 
there are some duos with no transitions. This is a possible outcome 
of a meiosis and is more likely to occur on the shorter 
chromosomes, and can also be the product of undetected 
recombination events. 

The Beagle haplotypes contain many more T2, and T4 
transitions. In the Val Borbera cohort when we compared the 
SHAPEIT2 and Beagle haplotypes to those estimated by Merlin 
we found that SHAPEIT2 produced 4,613 SEs in 1,074 
individuals corresponding to 4.3 switches per individual, whereas 
Beagle produced 29,681 switches or 27.6 switches per individual. 
These numbers seem consistent with what we observe in Figure 3. 
The higher rate of SEs in the Beagle haplotypes cause a large 
number of changes in estimated IBD state. 

Table 3 shows the mean number of state transitions in paternal 
and maternal duos for each cohort for SHAPEIT2 and Beagle. 
Note that and T4 transitions are biologically impossible as they 
represent a change in which child haplotype the parent is 
transmitting genetic material to. The SHAPEIT2 haplotypes 
typically have <0.5 of these transitions occurring per duo whilst 
Beagle ranges from 7.49 to 97.17 for transitions. 

Transitions T2 and T4 may correspond to crossover events or 
SEs in the parental haplotypes. The 2002 deCODE map gives the 
chromosome 10 genetic length as 1.34 and 2.18 Morgans for 
males and females respectively [37]. The mean numbers of T2 or 
T4 transitions in the SHAPEIT2 show rough agreement with the 
expected number of recombinations on chromosome 1 0 in most 
cohorts. For example, in the VIS cohort we observe 
72 + 74 = 1-32 and 72 + 74 = 2.09 transitions in males and 
females respectively. The female rates in CARL are higher than 
we might expect at 2.8. Both the male and female rates are lower 
than we might expect in the Korcula and Split cohort. This is 
likely due to insufficient information in the data to infer the true 
parental haplotype and hence we are seeing a parental SE at the 
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Figure 3. The duo HMM Viterbi paths for 50 father-child duos from the Val Borbera cohort on chromosome 10. The four possible IBD 
states (A, B, C, D) are shown using colours pale blue, dark blue, light red and dark red respectively. The left and right panels show the results of the 
duo HMM applied to the SHAPEIT2 and Beagle haplotypes respectively. Changes between a blue and red colour correspond to a or T4 transition, 
both of which imply a SE in the child. Changes of colour between light and dark blue or between light and dark red correspond to T2 transitions, 
which correspond to a change on IBD state in the parent, and could be caused by a recombination or a SE in the parent. The x-axis shows the sex- 
averaged genetic distance across the chromosome in centiMorgans. 
doi:1 0.1 371 /journal.pgen.1 004234.g003 



recombination event, that is, we are inferring the transmitted 
haplotypes. 

Figure S5 shows the results of applying our duo HMM on the 
SHAPEIT2 haplotypes of all father-child duos in a three sibling 
family before and after using our correction method. Before 
correction (Figure S5 (left)), the first father-child duo exhibits no 
evidence of recombination (change in colour between dark and light 
red) and these two individuals share a whole haplotype across the 
whole chromosome. The other father-child duos show evidence of 3 
and 4 recombination events respectively, with one event being 
shared in common at « 25 Mb. Our method infers a parental SE at 
this location. This has the effect (Figxire S5 (right)) of reducing the 
total number of inferred recombination events across the 3 duos 
from 0, 3 and 4 to 1,2 and 3, which seems more realistic. 

Table 2 shows the mean SE rate after applying haplotype 
corrections. The corrections lead to a consistent but small 
improvement for the SHAPEIT2 haplotypes, the largest improve- 
ment being a 0.009% decrease in SE for the CARL cohort. These 
results further highlight the very high quality haplotype estimates 
that SHAPEIT2 produces despite the fact it is ignoring all explicit 
pedigree information. 



We also find the duoHMM method is beneficial to Beagle and 
HAPI-UR when used to correct those haplotypes. For example the 
SE rate for HAPTUR drops from 7.722% to 6.193% for the Split 
cohort. Table S3 gives the average number and type of correction 
applied to each cohort for each method. SHAPEIT2's haplotypes 
require less than 0.5 corrections on average for chromosome 10, 
this is consistent with the very small improvement in SE. 

Partitioning pedigrees into Duos/Trios. Table 2 also 
shows the haplotype accuracy for pedigrees that were partitioned 
into duos and trios and then phased accordingly using Beagle 
(denoted as Beagle Duo/Trio). Not surprisingly, adding the duo/ 
trio relationships yields a substantial improvement to the Beagle 
haplotypes across all cohorts; for example we see a drop from 
1.362% SE to 0.445% SE in CARL. However they are still 
consistently less accurate than SHAPEIT2's haplotypes (even 
though SHAPEIT2 is not using any relationships). 

When using the Beagle Duo/Trio mediod some individuals will 
not be phased as part of a duo or trio, for example one of the 
children in a three sibling nuclear family. Figure 4 (top and centre 
left) plots the SE for such "unrelated" individuals for Beagle Duo/ 
Trio versus SHAPEIT2 and Beagle Duo/Trio versus SHAPEIT2+ 
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duoHMM respectively. These plots show that these individuals are 
phased much more accurately using our methods. Duos and trios 



I are phased almost equally well by all methods. 



We used our SHAPEIT2+duoHMM method to flag sites of 
possible genotyping error and then re-calculated the SE rates for 
SHAPEIT2+duoHMM and the Beagle Duo/Trio method ex- 
cluding these sites. The results are shown in the last two rows of 
Table 2. Comparing these results to the unmasked results we see 



Cl. CJi 
— Q. 
E £ 

S lu that the reduction in SE is greatest for SHAPEIT2+duoHMM, 



suggesting that genotyping error causes the results of Beagle Duo/ 
Trio to appear better than they are. The results from our 
g simulation study (described next) corroborate this point. Figure 4 

0 al (right) shows in more detail the effect of masking genotyping 

1 ^ errors, and that SHAPEIT2+duoHMM outperforms Beagle Duo/ 
J "g" Trio for individuals that were phased as 'unrelated' by Beagle. 

g 1^ Results on the simulated dataset with extended 

^ 5 pedigrees. The simulated pedigree data allows us to evaluate 

S ° the confounding effect of genotype errors in the Merlin and duo/ 

Q. I trio phased haplotypes. 

I § Figure S23 (top right) plots the SE of SHAPEIT2+duoHMM 

§ I versus Merlin on simulated data with realistic levels of genotyping 

's -5 error. SHAPEIT2+duoHMM is generally more accurate (average 

I ° of 0.033% versus 0.215% for Merlin on sites resolved by Merlin). 

I g' Without any genotyping error (Figure S23 bottom right) the 

performance is much improved for Merlin (SHAPEIT2+ 
duoHMM SE = 0.005%, Merlin = 0.021%). 

Figure S24 plots the SE rates on the simulated pedigrees for 
Beagle Duo/Trio SE rates of all individuals versus those of 
SHAPEIT2 (left) and SHAPEIT2+duoHMM (centre). The plot 
shows that the most accurate haplotypes are attained by the 
SHAPEIT2+duoHMM approach, and that the duo/trio con- 
strained phasing can be susceptible to genotyping error. The SE 
rates of SHAPEIT2, SHAPEIT2+duoHMM and Beagle Duo/ 
Trio were 0.104%, 0.065% and 0.269% respectively. When we 
Si + removed sites flagged as genotyping errors by our method the SE 

« rates were 0.073%, 0.034% and 0.231% respectively, suggesting 

£ I that the masking can remove switch errors caused by genotyping 

1 I errors. 

■§, "I Overall, these results suggest that at least some of the observed 

2 § discordance in the analysis of real data sets is due to errors in the 
^ a; Merlin haplotypes caused by genotyping error. The true error 
§. ^ levels are likely to be lower but the masking can help to remove the 
^ ^ ^ confounding effect. 
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Figure 4. Switch error rates for individuals in extended pedigrees for different phasing pipelines across all European cohorts 
(chromosome 10). Points are coloured according to what relationship was used by Beagle to phase that individual (red meaning no relationships 
were used). Left: Beagle using duo/trio phasing versus SHAPEIT2 using no relationships. Centre: Beagle using duo/trio phasing versus SHAPEIT2+ 
duoHMIVI using no relationships. Right: Beagle using duo/trio phasing versus SHAPEIT2+duoHMIVI using no relationships when masking loci flagged 
as probable genotyping errors by the duoHIVIM. Switch error is reduced for both methods suggesting the masking is sensible. 
doi:1 0.1 371 /journal.pgen.1 004234.g004 



we can achieve a FDR = 3.78% and TPR = 92.4% or 
P(i^=l)>0.9 for FDR=0.58% and TPR = 69.45%. In the 
absence of genotyping error (Figure S23 bottom left) the 
performance of Merlin is improved (FDR =3.48% and 
TPR = 95.66%) but SHAPEIT2 is marginaUy better 
(FDR = 0.71% and TPR = 95.75% at P(i^= 1)>0.5). 

Figure 5 compares tlie gene flow (and hence recombination) 
inferred by MerUn and tlie recombination probabilities inferred by 
our method for 1 0 informative meioses between parent-child duos 
from the Val Borbera cohort on chromosome 10. This figure 
shows very good agreement between our estimated probabilities 
and the events inferred by Merlin. However, these examples 
highlight that Merlin does infer some rather implausible, sporadic 
events in some meioses (even after running Merlin's error 
detection). 

Table 4 reports the percentage of Merlin recombinations that 
fall within a recombination region found by our method, and the 
percentage of our recombination regions that contain a Merlin 
recombination event. Percentages are given for each cohort 
separately. These results show only a rough concordance between 
Merlin and our method, with between 4 1 % and 6 1 % of Merlin 
recombinations detected by SHAPEIT2 and 80% to 89% of 
SHAPEIT2's recombination events concordant with Merlin. This 



table also shows the mean number of recombination events found 
by each method in maternal and paternal meioses. For compar- 
ison, the frequentiy cited deCODE 2002 genetic map estimated an 
average of 25.9 and 42.81 autosomal recombinations per paternal 
and maternal meioses respectively. The average number of 
recombinations for Merlin was substantially inflated across most 
cohorts (53 to 105 for maternal and 31 to 79 for paternal events) 
whilst SHAPEIT2's were in a more reasonable range (25 to 29 for 
paternal events and 41 to 47 for maternal events). Figure S25 plots 
the number of events found per meiosis by SHAPEIT2+duoHMM 
versus Merlin; there is obvious correlation but Merlin is typically 
reporting a much larger number of events than SHAPEIT2+ 
duoHMM. 

Figure 6 compares the distribution of the number of recombi- 
nation events found by Merlin and our method to what we would 
expect according to the 2002 deCODE family based map. Figure 6 
(top) plots the observed against expected number of recombina- 
tions in paternal and maternal meioses for each chromosome. The 
results from SHAPEIT2 are well calibrated against the expecta- 
tion, whereas the Merlin haplotypes exhibit elevated levels of 
recombination. Figure 6 (bottom) shows Q_Q;plots comparing the 
observed and expected number of genome-wide recombinations in 
paternal and maternal meioses for each duo when using a Poisson 
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50 100 150 

Chromosome 1 0 genetic length (cM) 

Figure 5. Inferred gene flow by Merlin (purple) and our method 
(grey) for ten informative meioses on chromosome 10 taken 
from Val Borbera cohort pedigrees. The light and dark purple 
represent genetic material from the grand-paternal and grand-maternal 
chromosomes (as inferred by Merlin's Viterbi algorithm), hence 
changing from light to dark implies a a recombination event. The grey 
rectangles contain the posterior probability (in black) of recombination 
from our method. The two methods broadly agree, although Merlin has 
inferred a number of implausibly small cross over events. 
doi:1 0.1 371/journal.pgen.1 004234.g005 

distribution witli rates 25.9 (paternal) and 42.81 (maternal) as our 
expected distribution. SHAPEIT2 rates are well calibrated against 
the expectation for paternal meioses, with some over-dispersion 
present in maternal meioses compared to the Poisson model. 
Figures S26, S27, S28, S29, S30, S31, S32 show these Q.Q;plots 
for each of the other cohorts separately. 

Figure S33 shows the average fine-scale recombination rate as a 
function of distance from the inferred recombination events 
inferred by both SHAPEIT2 and Merlin (black), only SHAPEIT2 
(red) and only Merlin (blue). The distribution of recombination 
rates around the SHAPEIT2-only crossovers is close to the 
distribution of crossovers found by both methods, whilst the 
recombination rates near Merlin events are on average lower. 

These results on both simulated and real data point to elevated 
false discovery rates for recombination detection with Merlin, 
corroborating what has previously been reported in the literature 
[39] whereas the SHAPEIT2-HduoHMM method can constrain 
FDR whUst stiU detecting a substantial proportion of true crossover 
events. 

Uninformative duos. Figure 7 (left) shows ROC curves for 
detecting recombination events apphed to our simulated uninfor- 
mative meioses. We found that recombination events could be 
detected with low false discovery rates, but the power of the 
method to detect recombination events is clearly limited by the 
demography of the sample. When closely related individuals were 
not removed we see that 53.51% of events could be detected with 
a false discovery rate of 5%, but when closely related individuals 
were filtered we could only detect 34.10% of events with 5% false 



discovery rate (posterior probability threshold of 0.7). Importantly, 
the posterior probabilities of a recombination event appear 
roughly calibrated (Figure 7 right) so by setting a high probability 
threshold, researchers can be confident they are detecting true 
events with our method. 

Using detected recombinations for association scans of 
hotspot usage. Figure S34 (left) shows the signal of association 
in the PRDM9 region for a meta analysis of aU the European 
cohorts (GPC excluded). When only the 618 informative parents 
were used (top) we found a minimum P-value of 6.25 x 10^" at 
SNP rs2 162866. The addition of 466 individuals lead to a modest 
increase in signal (P-value = 3.31 x 10^'^) at the same SNP, 
Figure S34 (right) plots the — logigP-values when all individuals 
are used against the — logigP-values when only informative 
individuals are used, demonstrating a consistent increase in signal 
in the region. This suggests that our method is indeed detecting 
true recombination events. 

Discussion 

Long range phasing has been a topic of interest since its 
inception by Kong (2008) [12] and has great potential for the 
analysis of genomic data, particularly as cohort sizes increase and 
hence more IBD sharing becomes present between individuals. 
Whilst the deCODE project has generated some excellent results, 
they have the advantage of an extremely powerful data set 
containing substantial amounts of IBD sharing which allows a rule 
based approach to long range phasing that yields very accurate 
haplotypes. This is not a luxury available to many research groups. 
We demonstrate that SHAPEIT2 implicitiy performs this very 
accurate long range phasing when possible, whilst stiU leveraging 
LD when it is not. 

Using eight cohorts from isolated and non-isolated populations, 
all containing explicitly related individuals, we have carried out a 
comprehensive evaluation of approaches for haplotype estimation 
in the presence of IBD sharing. We compared approaches that are 
specifically focused on estimation of haplotypes in isolated samples 
(SLRP) and others (SHAPEIT2, Beagle and HAPI-UR) that were 
designed predominantiy for cohorts of nominally unrelated 
individuals. Our experiments show that the SHAPEIT2 method 
provides high quality haplotypes that are more accurate than those 
estimated by SLRP, whereas Beagle and HAPI-UR produce 
results that are worse than SLRP. We find that the SE rates of 
SHAPEIT2 are a fraction of a percent in all cohorts, whereas the 
approaches BEAGLE and HAPI-UR produce SEs that are an 
order of magnitude larger. 

A big disadvantage of existing pedigree analysis software is the 
inability to leverage wider cohort information to resolve sites that 
are heterozygous throughout a particular pedigree. Hence there is 
a need for software that can fully leverage the relatedness within 
pedigrees for accurate phase whilst overcoming the limitations of 
traditional pedigree analysis software. We propose a two-stage 
approach in which SHAPEIT2 is first run ignoring all explicit 
family information. We then apply the duoHMM method to 
incorporate the pedigree information in a cohort to further 
increase the accuracy of haplotypes inferred. The duoHMM 
method infers the inheritance pattern in parent-child duos, detects 
genotyping errors and can correct switch errors. 

We have found that the resulting haplotypes from our 
method are so accurate that we can infer recombination events 
in parent-child duos. We use the output of our duoHMM to 
estimate the probability that a recombination event occurs 
between each pair of heterozygous markers. When applied to 
all eight cohorts across whole chromosomes we find that the 
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S g number of recombination events inferred by our metliod sliows 

^ close agreement with tlie genetic length of each chromosome. 

■° 2 We also find that the observed number of recombination 

TO E 

J 2 events per individual closely matches what we expect to 

§ £ observe based on genetic map estimates. These results are also 

2! § much better than those produced from Merlin, which 

c 1 shows elevated rates of recombination events across all 

K § chromosomes. On realistic simulated data our method 

(TPR=92.4%, FDR =3.78%) substantially outperforms Mer- 
lin (TPR= 90.57%, FDR= 62.48%). 

An additional benefit of our method is that we can attempt to 
infer recombination events in trios and duos. Methods that 
-| £ explicidy phase trios and duos using the pedigree information 

1 ^ cannot infer recombination events since they infer only the 
g 1 transmitted haplotypes of the parents. We evaluated this approach 
E o via simulation and found that we have «53% power to detect 
■B 3 events at a 5% false discovery rate when the duo is phased in an 
■S o> isolated cohort that may contain close relatives. When close 

2 relatives are removed we have « 34% power to detect events at a 
o § 5% false discovery rate. Cohorts that contain explicit trios and 
Z ^ duos could be phased using methods that explicidy use this 
^ 2! information if desired although the ability to infer recombination 

events would be lost and parents would be estimated as a pair of 
transmitted and untransmitted haplotypes. 

Using our method we are able to demonstrate that the 
recombination events that we infer from otherwise uninformative 
duos and trios can add power to association scans for recombi- 
nation phenotypes. Specifically, at the estabhshed PRDM9 locus 
we are able to show that including these extra recombination 
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phenotype. The field of study of recombination continues to be 
very active [47]. Future studies will look at recombination in 
g g ^ isolated populations in sub-saharan Africa. Our method will allow 

J ° Q GWAS of recombination phenotypes to be carried out in these 

g g -g populations, extracting as much information as possible from the 

0 ^ ^ data. 

c "g I Precisely determining why these large differences in perfor- 

1 S ^ mance between the methods exist is diflicult. We suspect that the 
ra ^ ™ reason resides in the fact that within the SHAPEIT2 method the 
£ g. haplotypes of each individual are explicidy modelled as a mosaic of 

the underlying haplotypes of other individuals [48] . In other words 
the underlying haplotype sharing between two individuals can be 
j2 5 explicidy captured by allowing each individual to 'copy' the 

2 I o haplotypes of another individual over a long stretch of sequence. 
51 E -o BEAGLE takes a difiFerent approach by collapsing the haplotype 
5; ^ ^ information of the sample into a compact graph. Each individual's 
o o haplotypes are then updated within the method conditional upon 
2 £ S this graph. Thus no direct comparison between pairs of individuals 
P < Tj is made and thus the information regarding long stretches of 
S 1 shared sequence between individuals is lost. These comments also 
^ g K ^Pply to HAPTUR which uses a different graph to encode the 
° = g haplotypes of the samples. Our results are consistent across a range 

of cohorts with differing levels of relatedness. Most of these cohorts 
S .& ° are isolated cohorts but the Split and GPC cohorts contain levels 

y i/i CU ^ 

g_ g I S of relatedness that might be expected in a GWAS cohort. 

I g 8 In this paper we have focused exclusively on genetic data 

^ y ^ from human samples but our methods may also be useful in the 

fields of animal and plant genetics where cohorts with high 
levels of relatedness are prevalent [49] . This method may also 
find utility in studies that aim to locate IBD segments between 
individuals. Based on these results we might suggest that a 
strategy of estimating haplotypes with SHAPEIT2 followed by 
application of the GERMLINE method [50] for IBD inference 
from haplotype data may provide an accurate and efficient 
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Figure 6. Distributions of the number of detected crossovers for all cohorts. Only duos that were part of an informative pedigree were 
used. Top: The mean number of recombinations per meiosis (for all informative duos from all cohorts) found for each chromosome against the 
expected number (from the 2002 deCODE map) for paternal meioses (left) and maternal meioses (right). Merlin's values are substantially inflated 
whilst SHAPEIT2's are more consistent with the well known deCODE map genetic lengths. Bottom: Q-Q plots for the observed against expected 
number of recombinations estimated by each method for paternal meioses (left) and maternal meioses (right). For the expected distribution of 
recombination rates, a Poisson distribution using the genetic lengths from the 2002 deCODE Map was used (with rate parameter 42.81 and 25.9 for 
maternal and paternal recombinations respectively). SHAPEIT2's rates are less inflated than those of the Merlin. 
doi:1 0.1 371 /journal.pgen.1 004234.g006 



solution. As cohorts increase in size, or as cohorts are 
combined, the chance of any individual sharing a close relative 
in the cohort increases. Methods such as SHAPEIT2, that can 
accurately leverage this IBD information when estimating 
haplotypes may help to extract the most information from such 
large cohorts. 

Previous research has demonstrated SHAPEIT2's effectiveness 
for phasing cohorts of unrelated individuals, in this paper we 
demonstrate that SHAPEIT2 is in fact effective across the fuU 
spectrum of relatedness. This means that researchers with cohorts 
with any mixture of unrelated, distantiy related or directly 
related individuals have a flexible tool available which can exploit 
all of these degrees of relatedness for very accurate haplotype 
estimates. 

Supporting Information 

Figure SI Computation time in hours for phasing chromosome 
10 for each of the full cohorts run as unrelated (second column of 
Table 4). AU jobs were run on an Intel Xeon CPU E5-2690 
(2.90 GHz) CPU with 256 GB of RAM. This is the total compute 



time for three runs of HAPTUR as specified in the manual. Beagle 
was ran without the lowmem option for better performance. Whilst 
no parallelism was employed here, options exists for exploiting 
multiple processors with varying degrees of difficulty for each piece 
of software. SHAPEIT2 and SLRP have the ability to run on N 
threads resulting in an approximately x reduction in compu- 
tation time. HAPTUR 3 x can be run as three simultaneous 
processes rather than sequentially. Most simply, the genome can 
be partitioned into chunks which are phased separately with the 
results being hgated. 
(TIFF) 

Figure S2 The proportion of heterozygote sites phased by 
Merlin against the size of the pedigree (all cohorts). Note pedigrees 
of the same size may have different structures. For example some 
pedigrees of size three are a parent and two children (as opposed to 
a mother-father-child trio). 
(TIFF) 

Figure S3 Computational performance of Merlin on simulated 
nuclear families of increasing size. Computation time (left) 
and memory usage (right) for Merlin's haplotyping routine 
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Figure 7. Recombination detection accuracy in uninformative duos simulated from chromosome X data in the Val Borbera cohort. 

The green values are for a cohort with nominally unrelated individuals and the orange values are for a cohort that has been filtered such that no 
individuals are closely related (j-<0.35). Left: The ROC curves for recombination detection in uninformative duos for our duo HMIVl using the 
SHAPEIT2 haplotypes. Right: The average number of correct detections against the average posterior probability. Setting a high probability threshold 
ensures a very low false discovery rate. 
doi:1 0.1 371 /journal.pgen.1 004234.g007 



applied to simulated nuclear families with increasing numbers of 
siblings assayed at 16,297 loci on chromosome 10 (Intel Xeon 
CPU E5-2690 2.90 GHz with 256 GB RAM). For pedigrees with 
< 12 non-founders, Merlin's computation time is negligible 
but the method will clearly become intractable for larger 
pedigrees. 
(TIFF) 

Figure S4 Schematic of phasing evaluation pipeline. This figure 
shows a toy example to illustrate the way in which we have used 
mixtures of pedigrees and unrelated samples to assess the 
performance of different methods. The figure shows two families 
of size 3 and 4 respectively and 2 unrelated samples. We used 
Merlin to phase the two families (blue), providing accurate 
haplotypes in the founders. We then ran each of the methods 
SHAPEIT2, Beagle and SLRP on the data from the founders and 
the unrelated samples (pink). The haplotypes estimated in the 
founder individuals was then compared to the Merlin phased 
haplotypes from these samples. 
(TIFF) 

Figure S5 Haplotype correction example using the DuoHMM. 
The Duo HMM Viterbi paths for a three father-child duos from a 
nuclear family (ie. 3 siblings) from the FVG cohort on 
chromosome 10. The four possible IBD states (A, B, C, D) are 
shown using colours pale blue, dark blue, light red and dark red 
respectively (although states A and B do not occur in this example). 
The left panel shows the path prior to any corrections, and the 
right panel after a minimum recombinant correction is applied. 
The second and third sibling initially had a C—^D transition at 
around 25 mb, this is more likely a recombination event in the first 
child hence the parental haplotypes are switched after this point. 
The panel on the right has the corrected haplotypes, the number 
of recombination events required to explain the observed data has 
been reduced. 
(TIFF) 

Figure S6 Evaluation of SHAPEIT2 and SLRP accuracy in IBD 
regions for the Val Borbera cohort. Left: The SHAPEIT2 switch 
error rates (within IBD regions) against the SLRP rates for each 
founder individual in the Val Borbera cohort. Both methods 
achieve low error rates but SHAPEIT2 has lower rates for most 



individuals. Right: The switch error rate for SHAPEIT2 within 
SLRP IBD regions against the rate outside SLRP IBD regions. 
Haplotypes are far more accurate when IBD sharing is present. 
(TIFF) 

Figure S7 The duo HMM Viterbi paths for 50 mother-child 
duos from the CARL cohort on chromosome 10. The four 
possible IBD states (A, B, C, D) are shown using colours pale blue, 
dark blue, light red and dark red respectively. The left and right 
panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a or T4 transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between light and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a switch error 
in the parent. The x-axis shows the sex-averaged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S8 The duo HMM Viterbi paths for 50 father-child duos 
from the CARL cohort on chromosome 10. The four possible IBD 
states (A, B, C, D) are shown using colours pale blue, dark blue, 
light red and dark red respectively. The left and right panels show 
the results of the duo HMM applied to the SHAPEIT2 and Beagle 
haplotypes respectively. Changes between a blue and red colour 
correspond to a or transition, both of which imply a switch 
error in the child. Changes of colour between light and dark blue 
or between light and dark red correspond to T2 transitions, which 
correspond to a change on IBD state in the parent, and could be 
caused by a recombination or a switch error in the parent. The x- 
axis shows the sex-averaged genetic distance across the chromo- 
some in centiMorgans. 
(TIFF) 

Figure S9 The duo HMM Viterbi paths for 50 mother-child 
duos from the FVG cohort on chromosome 10. The four 
possible IBD states (A, B, C, D) are shown using colours pale 
blue, dark blue, light red and dark red respectively. The left 
and right panels show the results of the duo HMM applied to 
the SHAPEIT2 and Beagle haplotypes respectively. Changes 
between a blue and red colour correspond to a Tj, or T4 
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transition, both of which imply a switch error in the child. 
Changes of colour between light and dark blue or between 
light and dark red correspond to T2 transitions, which 
correspond to a change on IBD state in the parent, and could 
be caused by a recombination or a switch error in the parent. 
The X-axis shows the sex-averaged genetic distance across the 
chromosome in centiMorgans. 
(TIFF) 

Figure SIO The duo HMM Viterbi paths for 50 father-child 
duos from the FVG cohort on chromosome 10. The four possible 
IBD states (A, B, C, D) are shown using colours pale blue, dark 
blue, light red and dark red respectively. The left and right panels 
show the results of the duo HMM applied to the SHAPEIT2 and 
Beagle haplotypes respectively. Changes between a blue and red 
colour correspond to a Ja or transition, both of which imply a 
switch error in the child. Changes of colour between light and dark 
blue or between light and dark red correspond to T2 transitions, 
which correspond to a change on IBD state in the parent, and 
could be caused by a recombination or a switch error in the 
parent. The x-axis shows the sex-averaged genetic distance across 
the chromosome in centiMorgans. 
(TIFF) 

Figure Sll The duo HMM Viterbi paths for 50 mother-child 
duos from the CROATIA-Korcula cohort on chromosome 10. 
The four possible IBD states (A, B, C, D) are shown using colours 
pale blue, dark blue, light red and dark red respectively. The left 
and right panels show the rc-sults of tlu- duo IIMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a or 74 transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between light and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a switch error 
in the parent. The x-axis shows the sex-averaged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S12 The duo HMM Viterbi paths for 50 father-child 
duos from the CROATIA-Korcula cohort on chromosome 10. 
The four possible IBD states (A, B, C, D) are shown using colours 
pale blue, dark blue, light red and dark red respectively. The left 
and right panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a Tj, or Tn transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between light and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a s^\ itch error 
in the parent. The x-axis shows the sex-averaged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S13 The duo HMM Viterbi paths for 50 mother-child 
duos from the ORCADES cohort on chromosome 10. The four 
possible IBD states (A, B, C, D) are shown using colours pale 
blue, dark blue, light red and dark red respectively. The left and 
right panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes be- 
tween a blue and red colour correspond to a or Ti, transition, 
both of which imply a switch error in the chUd. Changes of 
colour between light and dark blue or between light and dark 
red correspond to T2 transitions, which correspond to a change 
on IBD state in the parent, and could be caused by a 
recombination or a switch error in the parent. The x-axis shows 



the sex-averaged genetic distance across the chromosome in 
centiMorgans. 

(TIFF) 

Figure S14 The duo HMM Viterbi paths for 50 father-child 
duos from the ORCADES cohort on chromosome 10. The four 
possible IBD states (A, B, C, D) are shown using colours pale blue, 
dark blue, light red and dark red respectively. The left and right 
panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a Tj, or Tn transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between light and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a switch error 
in the parent. The x-axis shows the sex-averaged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S15 The duo HMM Viterbi paths for 50 mother-child 
duos from the CROATIA-Split cohort on chromosome 10. The 
four possible IBD states (A, B, C, D) are shown using colours pale 
blue, dark blue, light red and dark red respectively. The left and 
right panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a or 74 transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between light and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a switch error 
in the parent. The x-axis shows the sc'x-a\-eraged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S16 The duo HMM Viterbi paths for 50 fether-child 
duos from the CROATIA-Split cohort on chromosome 10. The 
four possible IBD states (A, B, C, D) are shown using colours pale 
blue, dark blue, light red and dark red respectively. The left and 
right panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a 73 or transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between hght and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a switch error 
in the parent. The x-axis shows the sex-averaged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S17 The duo HMM Viterbi paths for 50 mother-child 
duos from the Val Borbera cohort on chromosome 10. The four 
possible IBD states (A, B, C, D) are shown using colours pale blue, 
dark blue, light red and dark red respectively. The left and right 
panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a T3 or To, transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between light and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a switch error 
in the parent. The x-axis shows the sex-averaged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S18 The duo HMM Viterbi paths for 50 fether-child 
duos from the Val Borbera cohort on chromosome 10. The four 
possible IBD states (A, B, C, D) are shown using colours pale blue, 
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dark blue, light red and dark red respectively. The left and right 
panels show the results of the duo HMM applied to the 
SHAPEIT2 and Beagle haplotypes respectively. Changes between 
a blue and red colour correspond to a or Ti, transition, both of 
which imply a switch error in the child. Changes of colour between 
light and dark blue or between light and dark red correspond to T2 
transitions, which correspond to a change on IBD state in the 
parent, and could be caused by a recombination or a switch error 
in the parent. The x-axis shows the sex-averaged genetic distance 
across the chromosome in centiMorgans. 
(TIFF) 

Figure S19 The duo HMM Viterbi paths for .50 mother-child 
duos from the CROATIA- Vis cohort on chromosome 10. The 
four possible IBD states (A, B, C, D) are shown using colours 
pale blue, dark blue, light red and dark red respectively. The 
left and right panels show the results of the duo HMM applied 
to the SHAPEIT2 and Beagle haplotypes respectively. 
Changes between a blue and red colour correspond to a Tt, 
or Tn transition, both of which imply a switch error in the 
child. Changes of colour between light and dark blue or 
between light and dark red correspond to T2 transitions, which 
correspond to a change on IBD state in the parent, and could 
be caused by a recombination or a switch error in the parent. 
The X-axis shows the sex-averaged genetic distance across the 
chromosome in centiMorgans. 
(TIFF) 

Figure S20 The duo HMM Viterbi paths for ,50 father-child 
duos from the CROATIA- Vis cohort on chromosome 10. The 
four possible IBD states (A, B, C, D) are shown using colours 
pale blue, dark blue, light red and dark red respectively. The 
left and right panels show the results of the duo HMM applied 
to the SHAPEIT2 and Beagle haplotypes respectively. 
Changes between a blue and red colour correspond to a 73 
or 74 transition, both of which imply a switch error in the 
child. Changes of colour between light and dark blue or 
between light and dark red correspond to T2 transitions, which 
correspond to a change on IBD state in the parent, and could 
be caused by a recombination or a switch error in the parent. 
The X-axis shows the sex-averaged genetic distance across the 
chromosome in centiMorgans. 
(TIFF) 

Figure S21 The duo HMM Viterbi paths for 50 mother-child 
duos from the GPC cohort on chromosome 10. The four possible 
IBD states (A, B, C, D) are shown using colours pale blue, dark 
blue, light red and dark red respectively. The left and right panels 
show the results of the duo HMM applied to the SHAPEIT2 and 
Beagle haplotypes respectively. Changes between a blue and red 
colour correspond to a T-i or transition, both of which imply a 
switch error in the child. Changes of colour between light and dark 
blue or between hght and dark red correspond to T2 transitions, 
which correspond to a change on IBD state in the parent, and 
could be caused by a recombination or a switch error in the 
parent. The x-axis shows the sex-averaged genetic distance across 
the chromosome in centiMorgans. 
(TIFF) 

Figure S22 The duo HMM Viterbi paths for 50 father-child 
duos from the GPC cohort on chromosome 10. The four 
possible IBD states (A, B, C, D) are shown using colours pale 
blue, dark blue, light red and dark red respectively. The left 
and right panels show the results of the duo HMM applied to 
the SHAPEIT2 and Beagle haplotypes respectively. Changes 
between a blue and red colour correspond to a or 74 



transition, both of which imply a switch error in the child. 
Changes of colour between light and dark blue or between 
light and dark red correspond to T2 transitions, which 

correspond to a change on IBD state in the parent, and could 
be caused by a recombination or a switch error in the parent. 
The X-axis shows the sex-averaged genetic distance across the 
chromosome in centiMorgans. 
(TIFF) 

Figure S23 Detection of recombination in simulated extended 
pedigrees and comparison of SHAPEIT2 and Merlin haplotype 
accuracy for realistic data (top) and ideal data (bottom). Left: True 
positive rate (TPR) versus false discovery rate (FDR) for 
recombination detection using SHAPEIT2+duoHMM (red) versus 
Merlin (blue). Merlin detects 90.57% of crossovers with a 
substantial FDR of 62.48% whilst SHAPEIT2+duoHMM detects 
92.40% of events with an FDR of 3.78% (the red point at 
P(R=1)>0.5). On ideal data SHAPEIT2+duoHMM achieve a 
95.75% TPR and 0.7 1 % FDR and Merlin has 95.66% and 3.48% 
respectively. Right: Switch error for SHAPEIT2 (DuoHMM 
corrected) versus Merlin. The black point is the mean for each 
method. Merlin had 0.215% (0.021% ideal scenario) switch error 
while SHAPEIT2 had a rate of 0.033% (0.005% ideal scenario). 
(TIFF) 

Figure S24 Switch error results on simulated data for extended 

pedigrees. Points are coloured according to what family 
information was used by Beagle in duo/ trio mode; red meaning 
an individual could not be included in a duo or trio. Left: Beagle 
duo/trio switch error rate (0.269% average SE) against switch 
error rate for SHAPEIT2 when all individuals are treated as 
unrelated (0.104% average SE). Centre: After applying the 
duoHMM haplotype corrections to SHAPEIT2 (0.065'yo average 
SE). Right: Switch error after masking genotypes flagged as 
erroneous by the SHAPEIT2+duoHMM, Beagle duo/trio 
phasing is reduced to 0.231% and SHAPEIT2-l-duoHMM to 
0.034%. 
(TIFF) 

Figure S25 Genome-wide number of recombinations found 

per individual for SHAPEIT versus Merlin for all informative 
meioses in real data sets. The number of recombinations found 
by SHAPEIT2 against Merlin for each of 661 meioses (all 
informative duos from all cohorts). Maternal are in green and 
paternal purple. Triangles represent a truncated value. While 
there is correlation between the two methods, Merlin more 
frequently finds an implausible number of recombinations. 
(TIFF) 

Figure S26 Carlantino cohort recombination distributions. Q; 
Q_ plots for the obserx^ed against expected number of recombi- 
nations estimated by each method for paternal meioses (left) and 
maternal meioses (right), only duos that were part of an 
informative pedigree were used. For the expected distribution 
of recombination rates, a Poisson distribution using the genetic 
lengths from the 2002 deCODE Map was used (with rate 
parameter 42.81 and 25.9 for maternal and paternal recombi- 
nations respectively.) 
(TIFF) 

Figure S27 FVG cohort - recombination distributions. QjQ, 
plots for the observed against expected number of recombinations 
estimated by each method for paternal meioses (left) and maternal 

meioses (right), only duos that were part of an informative pedigree 
were used. For the expected distribution of recombination rates, a 
Poisson distribution using the genetic lengths from the 2002 
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deCODE Map was used (with rate parameter 42.81 and 25.9 for 
maternal and paternal recombinations respectively.) 

(TIFF) 

Figure S28 Korcula cohort - recombination distributions. QrQ, 
plots for the observed against expected number of recombinations 
estimated by each method for paternal meioses (left) and 
maternal mciosc's (right), only duos that were part of an 
informative pedigree were used. For the expected distribution 
of recombination rates, a Poisson distribution using the genetic 
lengths from the 2002 deCODE Map was used (with rate 
parameter 42.81 and 25.9 for maternal and paternal recombi- 
nations respectively.) 
(TIFF) 

Figure S29 ORCADES cohort - recombination distributions. 

Q;Q- plots for the obser\'ed against expected number of 
recombinations estimated by each method for paternal meioses 
(left) and maternal meioses (right), only duos that were part of an 
informative pedigree were used. For the expected distribution of 
recombination rates, a Poisson distribution using the genetic 
lengths from the 2002 deCODE Map was used (with rate 
parameter 42.81 and 25.9 for maternal and paternal recombina- 
tions respectively.) 
(TIFF) 

Figure S30 Valborbera cohort - recombination distributions. Q; 
Q, plots for the observed against expected number of recombina- 
tions estimated by each method for paternal meioses (left) and 
maternal meioses (right), only duos that were part of an 
informative pedigree were used. For the expected distribution of 
recombination rates, a Poisson distribution using the genetic 
lengths from the 2002 deCODE Map was used (with rate 
parameter 42.81 and 25.9 for maternal and paternal recombina- 
tions respectively.) 
(TIFF) 

Figure S31 Vis cohort - recombination distributions. Q;Q_ plots 
for the observed against expected number of recombinations 
estimated by each method for paternal meioses (left) and 
maternal meioses (right), only duos that were part of an 
informative pedigree were used. For the expected distribution 
of recombination rates, a Poisson distribution using the genetic 
lengths from the 2002 deCODE Map was used (with rate 
parameter 42.81 and 25.9 for maternal and paternal recombi- 
nations respectively.) 
(TIFF) 

Figure S32 GPC cohort - recombination distributions. QrQ, 

plots for the observed against expected number of recombinations 
estimated by each method for paternal meioses (left) and 
maternal meioses (right), only duos that were part of an 
informative pedigree were used. For the expected distribution 
of recombination rates, a Poisson distribution using the genetic 
lengths from the 2002 deCODE Map was used (with rate 
parameter 42.81 and 25.9 for maternal and paternal recombi- 
nations respectively.) 
(TIFF) 

Figure S33 Recombination rates in regions around crossover 
detections by each method. Average recombination rate (from 
the HapMap LD map) centred on the location (the average of 
the two flanking heterozygous positions) of 34344 crossovers 
found by both SHAPEIT2 and Merlin (black), 4127 cross- 
overs found only by SHAPEIT2 (red) and 32580 cross- 
overs found only by Merlin (blue). Events found only by 
Merlin are in regions with less recombination on average than 



those found by by SHAPEIT2 suggesting a higher false 

detection rate. 

(TIFF) 

Figure S34 Association testing between PRDM9 region and 
hotspot usage phenotype for European cohorts. Left: The 

— logjoP-values for the European meta-analysis for association 
between PRDM9 variants and the 'hotspot usage' phenotype. We 
used 6 1 8 informative and 466 uninformative parents in this analysis. 
Right: The — logigP-values of this analysis plotted against the 

— logigP- values when only the 618 informative parents are used. 
The additional samples yield a modest increase in power. 
(TIFF) 

Table SI Frequencies of dilferent pedigree sizes within each of 
the cohorts. Pedigrees of size 1 are individuals not part of an 

explicit pedigree. "Unrelated" is the sum of the number of 
pedigree founders and the number of individuals not in any 
pedigree. Note due to unspecified relationships, some of these 
individuals may stiU be closely related. 
(PDF) 

Table S2 The genotype confusion matrix used to simulate 
genotyping errors in our simulation studies. This is based on the 
discordance between lUumina Omni2.5S and Affymetrix Axiom 
chips on 1000 Genomes individuals. We took the Axiom 
genotypes as "truth" but halved the discordance and normalised 
the diagonal appropriately (the missing rate was left unchanged). 
This is to account for discordance that was actually due to Axiom 
chip errors. 
(PDF) 

Table S3 The average number of corrections applied to 
haplotypes for each method. 'P' and 'M' denotes when we correct 
a child's haplotypes using information from the parental (P) or 
maternal (M) haplotypes, to ensure consistent gene flow. 'C 
denotes when multiple children were used to find the minimum 
recombinant parental haplotypes. Very few corrections are 
required for the SHAPEIT2 haplotypes compared to Beagle and 
HAPI-UR, this is also evident from the switch error improvements 
shown in Table 2. Cohort abbreviations: CARL - Carlantino, 
FVG - Friuli Venezia Giulia, GPC - Ugandan General Population 
Cohort, KOR - CROATIA-Korcula, ORC - Orkney Complex 
Disease Study, SPL - CROATL\-Split, VB - Val Borbera. VIS - 
CROATIA-Vis. 
(PDF) 
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